start state
e8da56eb93676e8f60ed2b696e44e7dc-Supplemental-Conference.pdf
The goal location is small region around (20,20). In each task, S0 was a set of arm con gurations establishing contact with the 539 end-effector, the 6-DoF change in stiffness, and 1-DoF gripper state. The fraction of start states in S0 that lead to success 557 IVF, classi er). The result of that execution is recorded as 552 Algorithm 1 is the pseudocode used for the experiments described in Section 4.1. Episodes last a maximum of 1000 steps.
Autonomous Curriculum Design via Relative Entropy Based Task Modifications
Satici, Muhammed Yusuf, Wang, Jianxun, Roberts, David L.
Curriculum learning is a training method in which an agent is first trained on a curriculum of relatively simple tasks related to a target task in an effort to shorten the time required to train on the target task. Autonomous curriculum design involves the design of such curriculum with no reliance on human knowledge and/or expertise. Finding an efficient and effective way of autonomously designing curricula remains an open problem. We propose a novel approach for automatically designing curricula by leveraging the learner's uncertainty to select curricula tasks. Our approach measures the uncertainty in the learner's policy using relative entropy, and guides the agent to states of high uncertainty to facilitate learning. Our algorithm supports the generation of autonomous curricula in a self-assessed manner by leveraging the learner's past and current policies but it also allows the use of teacher guided design in an instructive setting. We provide theoretical guarantees for the convergence of our algorithm using two time-scale optimization processes. Results show that our algorithm outperforms randomly generated curriculum, and learning directly on the target task as well as the curriculum-learning criteria existing in literature. We also present two additional heuristic distance measures that could be combined with our relative-entropy approach for further performance improvements.
Layered LA-MAPF: a decomposition of large agent MAPF instance to accelerate solving without compromising solvability
Multi-Agent Path Finding (MAPF) has been widely studied in recent years. However, most existing MAPF algorithms assume that an agent occupies only a single grid in a grid-based map. This assumption limits their applicability in many real-world domains where agents have geometric shapes, rather than being point-like. Such agents, which can occupy multiple cells simultaneously, are referred to as ``large'' agents. When considering the shape and size of agents in MAPF, the computational complexity increases significantly as the number of agents grows, primarily due to the increased overhead in conflict detection between geometric agents. In this paper, we propose two types of subproblems for the LA-MAPF (Large-Agent MAPF) problem: \textbf{cluster} (which has no constraints on the order of solution) and \textbf{level} (which imposes constraints on the solution order). We introduce \textbf{Layered LA-MAPF}, a method that decomposes a MAPF instance involving geometric agents into clusters, and then further decomposes each cluster into levels. This approach aims to reduce time complexity when solving LA-MAPF problems. Our results demonstrate the performance of our method as the number of agents increases across various maps, and how it accelerates LA-MAPF methods, such as LA-CBS and LA-LaCAM. Experiments show that our LA-MAPF method with instance decomposition \textbf{halves the time cost (reducing from an average of 40s to 20s) and triples the success rate (from an average of 0.27 to 0.80)} in finding a solution within 60 seconds. To facilitate further research, we have made the source code for Layered LA-MAPF publicly available at \url{https://github.com/JoeYao-bit/LayeredMAPF/algorithm/LA-MAPF}.
Causal Contextual Bandits with Adaptive Context
Madhavan, Rahul, Maiti, Aurghya, Sinha, Gaurav, Barman, Siddharth
We study a variant of causal contextual bandits where the context is chosen based on an initial intervention chosen by the learner. At the beginning of each round, the learner selects an initial action, depending on which a stochastic context is revealed by the environment. Following this, the learner then selects a final action and receives a reward. Given $T$ rounds of interactions with the environment, the objective of the learner is to learn a policy (of selecting the initial and the final action) with maximum expected reward. In this paper we study the specific situation where every action corresponds to intervening on a node in some known causal graph. We extend prior work from the deterministic context setting to obtain simple regret minimization guarantees. This is achieved through an instance-dependent causal parameter, $\lambda$, which characterizes our upper bound. Furthermore, we prove that our simple regret is essentially tight for a large class of instances. A key feature of our work is that we use convex optimization to address the bandit exploration problem. We also conduct experiments to validate our theoretical results, and release our code at our project GitHub repository: https://github.com/adaptiveContextualCausalBandits/aCCB.
Model-based Offline Quantum Reinforcement Learning
Eisenmann, Simon, Hein, Daniel, Udluft, Steffen, Runkler, Thomas A.
This paper presents the first algorithm for model-based offline quantum reinforcement learning and demonstrates its functionality on the cart-pole benchmark. The model and the policy to be optimized are each implemented as variational quantum circuits. The model is trained by gradient descent to fit a pre-recorded data set. The policy is optimized with a gradient-free optimization scheme using the return estimate given by the model as the fitness function. This model-based approach allows, in principle, full realization on a quantum computer during the optimization phase and gives hope that a quantum advantage can be achieved as soon as sufficiently powerful quantum computers are available.
A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables
We address the problem of learning a sparse Bayesian network structure for continuous variables in a high-dimensional space. The constraint that the estimated Bayesian network structure must be a directed acyclic graph (DAG) makes the problem challenging because of the huge search space of network structures. Most previous methods were based on a two-stage approach that prunes the search space in the first stage and then searches for a network structure satisfying the DAG constraint in the second stage. Although this approach is effective in a lowdimensional setting, it is difficult to ensure that the correct network structure is not pruned in the first stage in a high-dimensional setting. In this paper, we propose a single-stage method, called A* lasso, that recovers the optimal sparse Bayesian network structure by solving a single optimization problem with A* search algorithm that uses lasso in its scoring system. Our approach substantially improves the computational efficiency of the well-known exact methods based on dynamic programming. We also present a heuristic scheme that further improves the efficiency of A* lasso without significantly compromising the quality of solutions. We demonstrate our approach on data simulated from benchmark Bayesian networks and real data.
Safe Mission-Level Path Planning for Exploration of Lunar Shadowed Regions by a Solar-Powered Rover
Lamarre, Olivier, Malhotra, Shantanu, Kelly, Jonathan
Exploration of the lunar south pole with a solar-powered rover is challenging due to the highly dynamic solar illumination conditions and the presence of permanently shadowed regions (PSRs). In turn, careful planning in space and time is essential. Mission-level path planning is a global, spatiotemporal paradigm that addresses this challenge, taking into account rover resources and mission requirements. However, existing approaches do not proactively account for random disturbances, such as recurring faults, that may temporarily delay rover traverse progress. In this paper, we formulate a chance-constrained mission-level planning problem for the exploration of PSRs by a solar-powered rover affected by random faults. The objective is to find a policy that visits as many waypoints of scientific interest as possible while respecting an upper bound on the probability of mission failure. Our approach assumes that faults occur randomly, but at a known, constant average rate. Each fault is resolved within a fixed time, simulating the recovery period of an autonomous system or the time required for a team of human operators to intervene. Unlike solutions based upon dynamic programming alone, our method breaks the chance-constrained optimization problem into smaller offline and online subtasks to make the problem computationally tractable. Specifically, our solution combines existing mission-level path planning techniques with a stochastic reachability analysis component. We find mission plans that remain within reach of safety throughout large state spaces. To empirically validate our algorithm, we simulate mission scenarios using orbital terrain and illumination maps of Cabeus Crater. Results from simulations of multi-day, long-range drives in the LCROSS impact region are also presented.
Recovery Policies for Safe Exploration of Lunar Permanently Shadowed Regions by a Solar-Powered Rover
Lamarre, Olivier, Malhotra, Shantanu, Kelly, Jonathan
The success of a multi-kilometre drive by a solar-powered rover at the lunar south pole depends upon careful planning in space and time due to highly dynamic solar illumination conditions. An additional challenge is that the rover may be subject to random faults that can temporarily delay long-range traverses. The majority of existing global spatiotemporal planners assume a deterministic rover-environment model and do not account for random faults. In this paper, we consider a random fault profile with a known, average spatial fault rate. We introduce a methodology to compute recovery policies that maximize the probability of survival of a solar-powered rover from different start states. A recovery policy defines a set of recourse actions to reach a safe location with sufficient battery energy remaining, given the local solar illumination conditions. We solve a stochastic reach-avoid problem using dynamic programming to find an optimal recovery policy. Our focus, in part, is on the implications of state space discretization, which is required in practical implementations. We propose a modified dynamic programming algorithm that conservatively accounts for approximation errors. To demonstrate the benefits of our approach, we compare against existing methods in scenarios where a solar-powered rover seeks to safely exit from permanently shadowed regions in the Cabeus area at the lunar south pole. We also highlight the relevance of our methodology for mission formulation and trade safety analysis by comparing different rover mobility models in simulated recovery drives from the LCROSS impact region.
Learning Search-Space Specific Heuristics Using Neural Networks
Liu, Yu, Kuroiwa, Ryo, Fukunaga, Alex
We propose and evaluate a system which learns a neuralnetwork heuristic function for forward search-based, satisficing classical planning. Our system learns distance-to-goal estimators from scratch, given a single PDDL training instance. Training data is generated by backward regression search or by backward search from given or guessed goal states. In domains such as the 24-puzzle where all instances share the same search space, such heuristics can also be reused across all instances in the domain. We show that this relatively simple system can perform surprisingly well, sometimes competitive with well-known domain-independent heuristics.
Indexability is Not Enough for Whittle: Improved, Near-Optimal Algorithms for Restless Bandits
Ghosh, Abheek, Nagaraj, Dheeraj, Jain, Manish, Tambe, Milind
We study the problem of planning restless multi-armed bandits (RMABs) with multiple actions. This is a popular model for multi-agent systems with applications like multi-channel communication, monitoring and machine maintenance tasks, and healthcare. Whittle index policies, which are based on Lagrangian relaxations, are widely used in these settings due to their simplicity and near-optimality under certain conditions. In this work, we first show that Whittle index policies can fail in simple and practically relevant RMAB settings, even when the RMABs are indexable. We discuss why the optimality guarantees fail and why asymptotic optimality may not translate well to practically relevant planning horizons. We then propose an alternate planning algorithm based on the mean-field method, which can provably and efficiently obtain near-optimal policies with a large number of arms, without the stringent structural assumptions required by the Whittle index policies. This borrows ideas from existing research with some improvements: our approach is hyper-parameter free, and we provide an improved non-asymptotic analysis which has: (a) no requirement for exogenous hyper-parameters and tighter polynomial dependence on known problem parameters; (b) high probability bounds which show that the reward of the policy is reliable; and (c) matching sub-optimality lower bounds for this algorithm with respect to the number of arms, thus demonstrating the tightness of our bounds. Our extensive experimental analysis shows that the mean-field approach matches or outperforms other baselines.